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(57) Abstract: A digital signal processor (10) uses a number of independent sub-processors (14, 20, 24, 28, 30) that may be con- 
trolled by a master programmable controller (18). For example, a specialized input processor (14) may process input signals while 
a specialized output processor (20) may process output signals. Each of these processors may also accomplish math funcuons when 
input and output processing is not necessary. The various processors may communicate with one another through general purpose 
registers (16) which receive data and provide data to any of the processors in the system. Math processors (26) may be added as 
needed to accomplish desired mathematical funcuons. In addition, a RAM processor (24) may be utilized to hold the results of inter- 
mediate calculations in one embodiment of the present invention. In this way, an adaptable and scaleable design may be implemented 
that accommodates a variety of different operations without requiring redesign of all the components. 
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Digital Signal Processor Having A 
Plurality Of Independent Dedicated Processors 

Background 

This invention relates generally to digital signal processing and in particular aspects to 
architectures for digital signal processors. 

Digital signal processors generally modify or analyze information measured as 
discrete sequences of numbers. Digital signal processors are used for a wide variety of signal 
processing applications such as television, multimedia, audio, digital image processing and 
telephony as examples. Most of these applications involve a certain amount of mathematical 
manipulation, usually multiplying and adding signals. 

A large number of digital signal processors are available from a large number of 
vendors. Generally, each of these processors is fixed in the sense that it comes with certain 
capabilities. Users attempt to acquire those processors which best fit their needs and budget. 
However, the user's ability to modify the overall architecture of the digital signal processor is 
relatively limited. Thus, these products are packaged as xmits having fixed and irnmutable 
sets of capabilities. 

In a number of cases, it would be desirable to have the ability to create a digital signal 
processor that performs complex functions that are specifically adapted to particular 
problems to be solved. Thus, it would be desirable that the hardware and software of the 
digital signal processor be adapted to a particular function. However, such a digital signal 
processor might enjoy a relatively limited market. Given the investment in silicon 
processing, it may not be feasible to provide a digital signal processor which has been 
designed to meet relatively specific needs. However, such a device would be highly 
desirable. It would provide the greatest perfomiance for the expense incurred, since only 
those features that are needed are provided. Moreover, those features may be provided that 
result in the highest performance without unduly increasing cost. 

Thus, there is a need for digital signal processor which is scalable, and adaptable to 
implementing a variety of unique applications in various configurations. 

Summarv 

In accordance with one aspect, a digital signal processor includes a mathematical 
processor, an input processor and an output processor. The input processor processes input 
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signals to the digital signal processor. The output processor processes output signals from the 
digital signal processor. A master processor controls the mathematical processor, the input 
processor and the output processor. A storage is selectively accessible by each of the 
processors. 

Other aspects are set forth in the accompanying detailed description and claims. 

Brief Description of the Drawings 

Figure 1 is a block diagram of one embodiment of the present invention; 

Figure 2 is a block diagram for one embodiment of the master program controller 
illustrated in Figure 1; . 

Figure 3 is a block diagram for one embodiment of the programmable input processor 
shown in Figure 1 ; 

Figure 4 is a block diagram for one embodiment of the programmable output 
processor shown in Figure 1 ; 

Figures 5 and 6 are diagrams showing the chaining process used in connection with 
the general purpose register shown in Figure 1 ; 

Figure 7 is a block diagram showing one implementation of the programmable RAM 
processor in accordance with one embodiment of the present invention; 

Figure 8 is a block diagram for one embodiment of the programmable math processor, 
shown in Figure 1 , in accordance with one embodiment of the present invention that 
implements additions and subtractions; 

Figure 9 is a block diagram for one embodiment of the programmable math processor, 
shown in Figure 1 , in accordance with one embodiment of the present invention which 
implements a multiply and accumulate operation; 

Figures 10, 1 1 and 12 show data path interface methods that may be utilized in 
connection with embodiments of the present invention; and 

Figure 13 is a flow chart for one embodiment of the present invention. 

Detailed Description 

A digital signal processor 10 may include a plurality of microprocessors 14, 18, 20, 
24 and 26 each having their own instruction sets. The individual processors need not 
communicate directly with one another but instead may communicate through storage 
registers associated with a general purpose register (GPR) 32 that is part of the registers 16. 
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Thus, the results of an operation performed by one of the processors may be stored in the 
GPR 32 for access by another processor. 

Each of the processors may be separately programmed with its own set of codes. The 
instruction sets for each processor may provide the logic for operating the particular functions 
5 for that processor, avoiding the need for separate hardware logic for implementing the 
subprocessor functions, in one embodiment of the invention. 

The master programmable controller (MPC) 18 provides the timing for the other 
processors and operates like an instruction execution controller. Knowing the times to 
execute a given instruction in a given processor, the MPC 1 8 waits for response from a given 
10 processor. In effect then the MPC 18 has instruction sets that enable it to assist others to 
operate on a cycle by cycle basis. Generally, one instruction is executed per cycle. 

Although each of the processors may be independently programmed, the instruction 
sets may be sufficiently similar so that an instruction set for one processor may be modified 
for use in other processors. This may decrease the time for programming each processor. 
15 A programmable input processor (PIP) 14 receives inputs from a receive buffer such 

as a first in first out (FIFO) register 12. The PIP 14 may, in some embodiments of the present 
invention, provide a precision change or a scaling of an input signal. The PIP 14 may be 
advantageous since it may provide for input data processing when input signals are available 
and may provide mathematical operations at other times. The input data need not be 
20 synchronized with the system 10 since the MPC 18 may wait for the PIP 14 to complete a 
given operation. Thus, in effect, the MPC 1 8 provides the synchronization for a variety of 
unsynchronized subprocessor units. 

The programmable output processor (POP) 20 provides outputs to a transmit buffer 22 
such as a first in first out (FIFO) register. The POP 20 may also do mathematical operations 
25 when no output data is available for transmission to the transmit buffer 22. 

The programmable random access memory (RAM) processor (PRP) 20 basically 
stores and retrieves data. It may be particularly advantageous in storing intermediate results, 
in a cache-like fashion. This may be particularly applicable in two-dimensional image 
processing. 

30 Some embodiments of the present invention may use normal length words but other 

embodiments may use the so-called very long instruction word (VLIW). VLIW may be 
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advantageous in some embodiments because the logic may be in effect contained within the 
instructions and glue logic to coordinate the various subprocessors may be unnecessary. 

Since each instruction may have a predetermined execution time independent of the 
data, the MPC 18 can control the various processor operations, on a cycle-by-cycle basis. 
5 Each of the processors is capable of operating in parallel with all of the other processors. 
Thus, in effect, the architecture shown in Figure 1 is a parallel processor; however, the 
architecture is such that the operations are largely broken down on general recurring 
functional bases. 

A number of mathematical processors may be provided within the unit 26 based on 

10 the particular needs in particular applications. In the illustrated embodiment, a pair of 

identical add and subtract programmable mathematical processors (PMP) 28a and 28b are 
combined with a pair of multiply and accumulate (MAC) programmable mathematical 
processors (PMP) 30a and 30b. However, a variety of other mathematical processors may be 
plugged into the digital signal processor 10 in addition to or in place of any of them or all of 

15 the illustrated PMPs, 

Each of the processors 14, 18, 20, 24, 28 and 30 may be programmable, contain its 
own random access memory and decode the random access memory contents into 
instructions, in one embodiment of the invention. The control of the programmable 
processors is accomphshed by the MPC 18. It controls when a given instruction is executed 

20 by any of the programmable processors. 

Thus, the MPC 18 controls the time of execution of instructions and is the only 
provider of instructions that are clock cycle active. The remaining programmable processors 
run at the same time. 

The register module 16 contains general purpose registers for storing data and 

25 allowing the accessing of data by the programmable processors. The inclusion of a 

programmable random access memory processor 24, programmable input processor 14 and 
the programmable output processor 20 allows very flexible input, output and data 
manipulation/storage operations to take place in parallel with mathematical operations. 

MPC 18 

30 The MPC 18 controls the processors 14, 20, 24, 28 and 30 that may be considered as 

slave processors to the MPC 1 8. Thus, the MPC 18 contains an instruction memory 40 and 
an instruction decode 38, as indicated in Figure 2, The MPC 18 determines when a slave 
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processor can execute instnictions and the slave processors communicate when data reads or 
writes to the external data bus 19 have completed. The MPC 18 is also responsible for 
generating cycle dependent or instmction dependent signals. Examples of such signals 
include interrupts, data tags and the like. 
5 In some embodiments of the present invention, the MPC 1 8 may be the only 

processor concerned with the slave processor timing. The MPC 1 8 may have the ability to 
meter control signals and globally affect the slave processors based on the state of those 
metered signals. The MPC 1 8 may also control when the processors execute a given 
instruction. This module is clock cycle accurate and may be used to control the parallel 

10 operation of an embodiment using VLIW. The MPC 18 contains instruction enables for each 
of the other processors. These instruction enables are used to control when the slave 
processor processes its next instruction. Null operations are performed by not issuing an 
enable during a particular clock cycle. 

The MPC 18 may decode the following instruction types in one embodiment of the 

15 invention: 

ENABLE Instruction for independent control enable of all slave processors. 

RESET_PC Instruction for resetting the program counter for each slave processor's 
.„ instruction memory. 

WAITONX Instruction used to synchronize the MPC and all of its slave processors to 
20 external data from the PIP and POP. 

REPEAT_N Instmction provides two types of repeat branches. REPEAT N times or 

REPEAT forever. 
JUMP_IF ' Instruction is a jump conditional instmction. 

JTJMPN Instmction provides three types of jump branches. JUMP N times, JUMP 

25 forever and JUMP RETURN for function calls. 

RETURN Instruction resets the program counter back to the JUMP RETURN 

instmction plus one. This may be used for function calls. 
In one embodiment of the present invention, the instmction decode may be 
implemented using the following table (with the number of bits per instruction shown in 
30 parentheses): 
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Note 


JVIOX3 10 i^oo 13 il iVlAJr 


uuu 






001 


Reset pc 


RMPC( 1 )&RPIP( 1 )&RPOP( 1 )&RPRPf 1 )&RPMPO 3(4)&ITYPE('3) 


010 


Wait on x 


WAIT POP(l)&WAIT PIP(1)&ITYPE(3) 


on 


Repeat N 


RJEPEAT N(5)&ITYPE(3) 


100 


Jump if 


TAG(2)&INSTR ADDR(9)&ITYPE(3) 


101 


JumpN 


JUMP CNT(4)&INSTR ADDR(9)&ITYPE(3j 


no 


Return 


ITYPE(3) 


111 


Stop ' 


STOP DEBUG(l)&STOP EN(1)&ITYPE(3) 



In the above instruction decode table, USR_DEF provides spare outputs, an example 
of which might be an interrupt. These outputs are registered and one set remains set until 
reset by the same instruction. TAG is used to generate tags for the transmit FIFO register 22. 
The tags can be used for items such as cyclical data. An example of cyclical data is RGB 
5 pixels in an image processing application. These outputs are registered and once set, remain 
set until reset by the same instruction. 

PIP, POP, PRP, PMPO-3 are bits that when set to one enable the execution of an 
instruction for each of the slave processors. ITYPE is the instruction type as defined in the 
above instruction decode table. RMPC, RPIP, RPOP, RPRP, RPMPO-3 are bits that when set 
10 to one reset the corresponding program counter for the respective processor. The program 

counter is used to keep track of the current location of the instruction being executed by each 
processor. 

WAIT_POP and WAIT_PIP cause the MPC 18 to wait until a one is detected from a 
corresponding POP or PIP slave processor. This function can be used to trigger up a block of 

15 data to be operated on or to send out a block of data. REPEAT_N is a value that causes the 
last instruction to be repeated N+1 times. A zero is equal to repeat once, a one is equal to 
repeat twice and so on. If REPEAT__N equals a maximum, then a REPEAT forever will 
occur. The maximum is defined as the maximum value that can be used in a field (e.g. 
REPEAT_N (5) equals "1 1 1 11" or 3 1 decimal). A repeat instruction does not cost a clock 

20 cycle to execute. 

The JUMP_IF instruction operates in the following fashion. In a first step, the first 
encounter of a JUMP_IF instruction arms a comparator. In a second step, the second 
encounter of a JUMP_IF instruction identical to the previous one causes a jump to an address 
if the comparator has detected a match of the tag as armed in the first step. After the jump, 

25 the JUMP_IF is dearmed. In a third step, any second encounter of a JUMP_IF instruction 
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with a different tag than in the first step, changes the comparator and rearms the JUMP_EF 
instruction. 

INST_ADDR is a jump to an instruction address. The jump instruction generally 
takes one cycle. JUMP^CNT is the number of times to jump to the INST_ADDR. For 
5 example, zero equals one jump, one equals to two jumps, etc. If JUMP_CNT is a maximum 
then a JUMP forever may occur. If the JUMP_CNT is equal to a maximum minus one, then ' 
the JUMP_RETURN may occur. 

A STOP_EN is a bit that when set to one stops the MPC 1 8 until a reset or until the 
unit is tumed from off to on. When changing from the off state to the on state, the MPC 1 8 

10 may be reset to the first instmction. A STOP_DEBUG indicates that the MPC 1 8 will stop 
until a toggle from the off state to the on state occurs. At this point, operation resumes at the 
next instruction. This mode may be used for debugging. All the processors can be stopped 
and then all the contents of the registers 32 or the processor RAMs may be read. 

The MPC 1 8 provides a single central master processor to control the operation of the 

15 slave processors. This allows easier portability across different process technologies used to 
make the various processors. As the slave processors are added, removed, modified or 
redesigned to meet timing with different throughputs, only the master processor program may 
change. This avoids the need to completely redesign the entire digital signal processor 10. 
With an instmction assembler that generates the machine code for the master control 

20 processor, this process may be relatively fast and easy. Easy generation of cycle accurate or 
program dependent signals may be accomplished with the MPC 18, Slave processors may be 
removed or added with ease allowing the creation of custom digital signal processors with 
different performances. 

PIP 14 

25 As shown in Figure 3, the PEP 14 includes an instruction memory 46, an instruction 

decoder 44 and a math capability 48. The PIP 14 is capable of implementing addition, 
subtraction and shift left functions on the input data as well as internal data in accordance 
with one embodiment of the present invention. The MPC 1 8 controls the execution of 
instructions by the PIP 14. The PIP 14 signals the MPC 18 when input data reads are 

30 complete. The PIP 14 uses self-timed math modules to execute instructions and math 
functions. j 



i 
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The PIP 14 may send incoming data from a receive FIFO register 12 into the GPR 32. 
The PIP 14 has the ability to add a full signed 16-bit offset and scale up by shifting left the 
incoming or internal data. The PIP 14 also has overflow and underflow error flags that can 
be used by other entities to determine what to do with the data. When operating on internal 
5 data, the PIP 14 may read or vmte data from any GPR 32 register and this mode is clock 
cycle acctirate. 

When operating on data from the input FIFO register 12, the PIP 14 may write data to 
any GPR 32 register. The PIP 14 interfaces with the receive FIFO register 12 with a 
busy/vahd protocol. An instruction done signal may be sent to the MPC 18 whenever the 

1 0 receive FIFO register 12 instruction is completed. 

The PIP 14 may also set up register chains in the GPR 32. For example, in the chain 
mode, data written into a register zero does not destroy data in the register zero. Instead the 
data from the register zero is automatically written to a register one and data from the register 
one is automatically written to the register two and so on until the end of chain (HOC) is 

15 reached. Thus, the register zero is now defined as the start of chain (SOC) because the PIP 
14 is writing to register zero! This may all happen in one clock cycle, allowing fast sliding 
filter operations in both one and two dimensions. 

If a global EOC bit is set equal to one, then any write to the GPR 32 may define a 
vahd SOC. If the global EOC is set to zero, only PIP 14 writes to the GPR 32 are defined as 

20 a vahd SOC. The global EOC mode may be used in Infinite Impulse Response (IIR) filter 
applications. 

The PIP 14 may decode the following instruction types in one embodiment of the 
invention: 

RXFIFO Instruction used to get data from an external source. 

25 INTERNAL Instruction used for routing of data from an external or internal source. 
OFFSET Instruction that can add an offset to incoming data. 

SHIFTLEFT Instruction can shift left incoming data bits. 

EOC Instruction used for global chaining to tell where the end of the chain is 

located. 

30 REPEATN Instruction provides two types of repeat branches—REPEAT N times or 

REPEAT forever. 
JUMP_IF . Instruction is a jump conditional instmction. 
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JUMP_N Instruction provides three types of jiimp branches, JUMP N times, JUMP 

forever and. JUMP RETURN for functional calls. 
RETURN Instruction resets the program counter back to the jump return instruction plus 

one. 

5 The following instruction decode table may be used in one embodiment of the present 

invention: 



ITYPE 


Note 


MSB to LSB BIT MAP 


0000 


Rxfifo 


OUANTY(5)&DEST(5)&ITYPE(4) . 


0001 


internal 


SUBEN(l)&DEST(6)&SOURCE B(;6)&S0URCE A(6)&ITYPE{4) 


0010 


Offset 


OFFSET( 1 6)&ITYPE(4) 


0011 


Shift left 


SHIFT L(4)&ITYPE(4) 


0100 


Eoc 


GLOBAL EOCfl)&EOC(5)&ITYPE(4) 


0101 


Repeat N 


REPEAT N(5)&ITYPE(4) 


0110 


JumpN 


JUMP CNT(6)&INST ADDR(7)&ITYPE(4) 


0111 


Return 


ITyPE(4) 


1000-1111 


Reserved 





QUANTY is a counter that controls the number of words fetched from the incoming 
data source. If set to zero, one word is fetched, if set to one, rwo words at fetched and so on. 
Upon completion of the total amount to be fetched, the PIP 14 signals the MFC 18 that the 
10 instruction has been executed. This instruction stays asserted until the next instruction is 
enabled. 

DEST is the destination address to the GPR. ITYPE is the instmction type as defined 
in the instruction decode table. SUBEN is a bit that when set equal to one enables a 
subtraction as follows: DEST=SOURCE_A-SOURCE_B. When set to zero, SUBEN 
15 . enables an addition as follows: DEST=SOURCE_A-fSOURCE_B. SOURCE_A is the 

address of the A input to the adder/sub tractor. SOURCE_B is the address of the B input to 
the adder/subtractor. 

An OFFSET is the amount in signed format added to the input data as follows: 
DEST_GRP = input data + offset. Subtraction occurs if the value is negative. SHIFT _L is 
20 the amount of shifts left performed on the input data. Decoding may be as follows: 0000 
means no left shifts, 0001 means one left shift, ...,1000-1 1 1 1 means eight left shifts. 

GLOBAL_EOC is a bit that when set to one enables the global EOC mode. In this 
mode, the processors that write to the GPR zero register will be considered the SOC. Wh^ 
this mode is not active, only the PIP 14 writes to the GPR define the SOC. EOC is the end of 
25 chain address for the GPR. The SOC is always defined as the DEST_GPR or DEST if it 
addresses the GPR. If the EOC is 000000 then no chaining of the GPR occurs. When the 
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EOC is less than or equal to the SOC, no chaining occurs. When the EOC is greater than the 
SOC, then register chaining occurs. 

REPEAT_N, JUMP_CNT, INST_ADDR and RETURN are all as defined for the 
MPC 18. 

5 Through the use of the PIP 14, data may be transferred from working registers in a 

programmable fashion. Data input and other slave operations may occur independently of 
each other. The PIP 14 allows writing data to a destination that may not be ready for data. 
While not perfomiing data transfers, the PIP 14 can be used to perform internal math 
functions. 

10 POP 20 

The POP 20 has an instruction memory 54, an instruction decode 52 and a math unit 
56 as shown in Figure 4. The POP 20 is capable of implementing addition, subtraction, shift 
right, ceiling, flooring, absolute value and round functions on internal data in accordance with 
one embodiment of the present invention. The POP 20 also has overflow and underflow 
15 flags. The MPC 18 controls the execution of instructions by the POP 20. The POP 20 

signals to the MPC 18 when output data writes are complete. The POP 20 uses self-timed 
math modules to execute instructions and math functions. 

The POP 20 is responsible for sending data to the transmit buffer 22 from any of the 
GPR 16 registers while in the transmit data mode of operation. While in the internal mode, 
20 the POP 20 may send and receive data from any of the GPR 1 6 registers. The internal mode 
of operation occurs when the POP is not sending data to the transmit FIFO register 22. 

The transmit data mode of operation may not be clock cycle accurate where the 
internal mode of operation may be clock cycle accurate. Transfers to the transmit FIFO 
register 22 may not be clock cycle accurate because the register 22 may be fiill. However, 
25 the POP 20 can signal the MPC 1 8 when an instruction is finished being executed. 

In one embodiment of the present invention, the POP 20 may perform 16 bit signed 
addition, subtraction, shift right operations with rounding, maximum clamping, minimtmi 
clamping and absolute value determinations. The shift right operation may round to the least 
significant bit. For example, if 16 bits are shifted right eight bits, then rounding occurs on the 
30 lower eight bits, affecting the upper eight bits. Rounding will not occur if rounding causes 
overflow or underflow of the original 1 6-bit signed data. The order of operations on data 
may be as follows: add/subtract, roundy'shift, max clamp/min clamp, absolute value. 
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A configuration register enables the absolute value function, minimum threshold and 
maximum threshold. Two 16-bit configuration registers store the signed maximum and 
minimum threshold values. These configuration registers may be changed by fiimiware or by 
instruction. 

The POP 20 may use the follow^ing instruction types in one embodiment of the present 



invention: 




TXFIFO 


Instmction used to send data to an external source. 


INTERNAL 


histruction used for routing of data. 


OFFSET 


Instruction to add an offset to incoming data. 


SHIFT RIGHT 


Instruction used to shift right outgoing data bits. 


MAX 


Instruction to clamp maximum data allowed. 


MIN 


Instruction to clamp minimum data allowed. 


REPEAT_N 


Instruction for two types of repeat branches as described previously. 


JUMPJF 


Instruction for a conditional jimip. 


JUMP_N 


Instruction for the three types of jump branches. 


RETURN 


Instruction to reset the program counter. 



The following example illustrates the instmction decode in one embodiment of the 
present invention: 



ITYPE 


note 


MSB to LSB BIT MAP 


0000 


tx fifo 


QUANTY(5)&S0URCE GPR(5)&ITYPEf4) 


0001 


internal 


SUBEN(l)&DESTf6)(S:SOURCE Bf6)&SOURCE A(6)&ITYPE (4) 


0010 


Offset 


OFFSET( 1 6)&ITYPE(4) 


001 1 


Shift rieht 


SHIFT R(4)&ITYPE(4) 


0100 


Max 


CLAMP MAX{I6)&ITYPE(4) 


0101 


min 


CLAMP MIN(16)&ITYPE(4) 


0110 


Repeat N 


REPEAT N(5)&ITYPEC4) 


0111 


JumpN 


JUMP CNT(6)&INST ADDR(7)&ITYPE(4) 


1000 


Return 


ITyPE(4) 


1001 


abs en 


ABS EN(1)(S:ITYPE(4) 


1010-1111 


reserved 





QUANTY is a coimter as defined in connection with the PIP 14. SOURCE__GPR is 
the start address to the GPR 16. Data of quantity N is fetched starting at the SOURCE^GPR 
address and automatically incremented until the quantity N has been sent to the transmit 
FIFO register 22. As a result, a row of N data may be sent with one instruction. 

ITYPE is the instruction type as defined in the instmction decode table set forth 
above. OFFSET, SUBEN, SOURCE_A and SOURCE^B have the same definitions as set 



wo 01/44964 PCT/USOO/28630 

- 12- 

forth previously with respect to the PIP 14. SHIFTER is the amount of shift rights performed 
on input data up to eight right shifts. The decoding is as follows: 0000 is no right shifts, 0001 
is one right shift, 1000 to 1 1 1 1 is eight right shifts. Rounding of the bits that fall off as a 
result of a shift right is also performed. 
5 CLAMP_MAX is the maximum 16 signed value output and CLAMP_MIN is the 

minimum 16 bit signed output. REPEAT__N, INST_ADDR, RETURN and JUMP_CNT are 
as defined in cormection with the PIP 14. ABS_EN is a bit that when set to one enables the 
absolute value fimction to be performed on the final output after the clamping function. 

With the POP 20, data may be transferred from working registers in a programmable 
10 fashion. Data output and other slave operations may occur independently of one another. 

The POP 20 also allows the data to be written to a destination that may not be ready for the 
data. When not performing data transfers, the POP 20 can be used to perform internal math 
fiinctions. 

Registers 16 

15 The registers 16 include a bus interface 34 and N general purpose registers 32 

configured to allow the chaining and global chaining modes as well as independent read and 
write operations that can occur from a number of processors at the same time. The GPR 32 
allows independent data transfers firom and to any of the other processors. The GPR 32 
includes registers for each of the processors that can be written to by any processor module. 

20 If two processor modules try to write to the same register, an error flag is set. 

In chaining mode, the GPR 32 may be configured to chain one register's output to that 
of another register. The data written to a register 0 firom the PIP 14 does not destroy data in 
the register 0, but instead the data fi-om the register 0 is automatically written over to register 
1 and data fi-om the register 1 is automatically written over to the register 2 and so on in one 

25 clock cycle until a programmable EOC is reached. Thus, as shown in Fig, 5, when data is 
written into GPR zero indicated at 58, the data is automatically transferred to GPR one 
indicated as 60 and so on. 

The SOC is defined as the present GPR location that the Pff 14 is writing to. This 
allows fast Finite Impulse Response (FIR) filters as well as fast shding filter operations in 

30 both one and two dimensions. As an example, if SOC is set to six, a write from the PIP 14 to 
GPR six in this example produces no chaining. Allowing only the PIP 14 to define the SOC 
allows fetching of the next set of data in order while the last set is being operated on without 
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using register to register move instructions. When contention occurs, any Avrites to any GPR 
by a processor take precedence over chaining rights from the previous register. 

The use of a global data chaining, shown in Fig. 6, allows data to be processed more 
efficiently when implementing IIR filters. Global data chaining is defined as allowing 
internal math modules to form the SOC. This allows computed data to generate an SOC as 
opposed to allowing only the PIP 14 to form a data chain. In the global chaining mode, the 
PIP 14 cannot define the SOC. If the global chaining mode is active, then any write to a GPR 
from any processor except the PIP 14 can define a valid SOC. When performing TTR 
filtering, the SOC may be defined by other processors because the input data may be operated 
on before insertion into the chain. When contention occurs, any writes to the GPR by a 
processor take precedence over chaining rights from the previous register. 

Referring to Figure 13, chaining code 106, which may be stored in association with 
the MPC 1 8, begins by determining whether global chaining has been selected as indicated at 
diamond 107. If global chaining has been selected, a check determines whether the PIP 14 
has written to a general purpose register 0 as indicated in diamond 108. If so, GPR 0 is set 
equal to the new word and no chaining is indicated. If the PIP did not write to GPR 0, a 
check at diamond 112 determines whether there is any other processor write to register GPR 
0. If so, the SOC is set equal to zero and the HOC is set equal to a programmable value 
greater than or equal to the SOC. GPR 0 is set equal to the new word and GPR 1 is set equal 
to GPR 0, and GPR 2 is set equal to GPR 1, and GPR (EOC) is set equal to GPR (EOC-1) as 
indicated in block 114. If GPR (EOC) equals GPR 0, then no chaining takes place. A write 
to a GPR always occurs. 

If global chaining has not been selected, then a check at diamond 1 1 6 determines 
whether the PIP 14 has written to a GPR register X (GPR(X)). If not, a check at diamond 
120 determines whether there are any other writes to the register X. If so, the register X is set 
equal to the new word and no chaining occurs, as indicated in block 122. 

If the PIP did write to the register GPR (X) then the SOC is set equal to the GRP(X) 
as indicated in block 118. The EOC is set equal to a programmable value greater than or 
equal to die SOC. The GPR (X) is set equal to the new word and the register GPR (X+1) is 
set equal to GPR (X), GPR (X4-2) is set equal to GPR (X+l) and GPR (EOC) is set equal to 
the register GPR (EOC-1). If the register GPR (EOC) is less than or equal to the register 
GPR (X), then no chaining takes place but a write to GPR (X) always occurs. 
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The shift &om one register to another may take place in one clock cycle. For 
example, the shift from register one to register zero, the shift from register two to register one 
and the shift from register three to register two and so on, all may occur in one clock cycle, in 
one embodiment of the present invention. 

Use of chaining and global chaining modes allows independent data processing to 
occur in any of the processors. In addition it may allow faster IIR filters, FIR filters, sliding 
N dimensional filters and vector products, without the need for a large number of register to 
register instructions. 

PRP 24 

The PRP 24 includes a number of random access memory (RAM) modules 74, as 
shown in Figure 7, The number of modules 74 equals the number of sub-processors that use 
the PRP 24. Thus, the N RAMs 74 are coupled to an instruction decode unit 72 which in turn 
is coupled to an instruction RAM 76. The N RAMs 74 can be programmed to read and write 
based on instructions contained in the instmction memory 76. Each RAM 74 is able to read 
and write independently of the others. 

The PRP 24 may allow internal storage of N 16 bit data blocks in one embodiment of 
the invention. The PRP 24 may be used for filter operations where data is used recursively or 
data flow would be too restrictive on performance. An example is performing two 
dimensional discrete cosine transforms (DCT) where the filtering is performed on eight 
columns and then on eight rows. Another example is in direct storage of quantization tables 
and memory so that the zigzag operations and quantization may take place at the same time. 

The PRP 24 may read or write to any GPR 1 6 register. Since the PRP 24 contains N 
separate memories 74, N reads or writes or a mix may take place at the same time. The 
firmware has direct access to all of the N RAM memories 74. 

The PRP 26 may decode the following instruction types in one embodiment of the 
invention. A RDAVR is an instruction that provides independent read/write control of the 
multiple RAMs 74. A JUMPN is an instruction that provides three types of jump branches as 
described previously and similarly a RETURN operates as described previously in 
connection with other processors. 

As an example of one embodiment of the invention, tlie following instruction decode 
table is provided for the PRP 24: 
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ITYPE 


note 


MSB to LSB BIT MAP 


00 


rd/'wr 


B EN(1)&B WRJEN(1)&B ADDR(6)&B RAMADDR(7)&A EN(l)&A_UTl£N( 
D&A ADDRC6)&A RAMADDR(:7)&ITYPEr2) 


01 


lumpN 


JUMP CNT{6)INST ADDR('10)&ITYPE(2) 


10 


return 


ITYPEC2) 


11 


reserved 





In the above instruction decode table, B_EN is a bit that when set to one enables a read 
or write from a RAM B. When set equal to zero, it disables a read or write to the RAM B. 
B_WREN is a bit that when set to one causes a write to RAM B if B_EN is one. When the bit 
B__WREN is set equal to zero, it allows a read to a RAM B if B_EN is set equal to one. 
5 B_ADDR is the address of a destination for a RAM read and the address of the source 

for a RAM write. B___RAMADDR is the RAM address for a read or write. Similarly, A_EN 
is a bit that if set to equal to one enables a read or write from a RAM A and when set equal to 
zero disables a read or a write from RA^I A. A_WREN, A_ADDR and A_RAMADDR are 
the same as B_WREN, B_ADDR or B_RAMADDR except they apply to the RAM A. 
10 ITYPE is the instruction type defined using the two least significant bits. 

JUMP__CNT, INST_ADDR, and RETURN are the same as described previously in 
connection with other processors. 

The PRP 24 allows coefficients and data to be saved in a local RAM so as to be 
available for data processing without reading data or coefficients from an external device. 
15 This may reduce input/output performance degradation during signal processing. 

The use of a RAM to store a large number of coefficients results in the use of a 
smaller area than using registers. All RAM reads and writes are controlled by instructions 
and can operate independently of other slave processor operations, in one embodiment of the 
invention. This may allow faster signal processing. 

20 PMPs 28. 30 

The PMPs 28 include an instruction decode unit 80, an adder/subtractor 84 and an 
instmction.RAM 82 as shown in Figure 8. A PMP 28 performs addition or subtraction of two 
inputs and sends the result to the GPR 32. The source and destinations are defined by the 
instmction. The processor executes the instructions using self-timed mathematical modules. 
25 The main function supported by a PMP 28, in one embodiment of the invention, is to 

add or subtract two signed 16 bit numbers and output a 16 bit signed result. The PMP 28 also 
has overflow and underflow flags. The PMP 28 can receive data from any of the GPR 32 
registers and can provide data to any of the GPR 32 registers. 



1 
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In one embodiment of the invention, the PMP 28 decodes any of the following types 
of instructions. An add/subtract instruction provides control of where two inputs to the 
adder/subtractor come from and where the result goes and whether the processor is in the add 
or subtract mode. The REPEAT N, JUMPN and RETURN instmction types are as described 
previously. 

As an example of an instmction decode set for a PMP 28, the following table is 
provided: 



ITYPE 


note 


MSB to LSB BIT MAP 


00 


Add/sub 


SUBEN(1)&DEST(6)&S0URCE B(6)&SOURCE A(6)&ITYPE (2) 


01 


Repeat N 


REPEAT N(5)&ITYPE(2) 


10 


jumpN 


JUMP CNT(6)&INST ADDR(7)&ITYPE(2) 


11 


return 


ITYPE(2) 



SUBEN, SOURCE_A, SOURCE__B, ITYPE, REPEAT_N, JUMP_CNT, 
INST_ADDR and RETURN are all as described previously, for example in connection with 
the PIP 14. 

The use of a PMP 28 allows addition and subtraction operations to occur 
independently of other processors. Since the PMP 28 is fully modular in design, it allows 
scalablility of the overall digital signal processor 10. 

The PMP 30 is a multiply and accumulate (MAC) processor with its own instruction 
memory 90, instruction decode 88, and math module 92 as shown in Figure 9. The processor 
performs multiply and accumulate operations, and sends the results of its operations to a GPR 
32 register. The source and destination are defined by the instruction. The processor may 
execute instructions using self-timed mathematical modules. 

The main supported function of the PMP 30 is to multiply two signed 16 bit numbers 
to produce a 32 bit result. The result may be added from a previous 32 bit result to foma a 
multiply and accumulate (MAC) function. The accumulator size is 32-i-N bits allowing for 
internal extended precision operation. The results of the accumulator are roimded to 16 bits 
and shifted right 16 bits to produce a signed 16 bit result. The PMP 30 also has overflow and 
underflow flags. The PMP 30 may receive data from any GPR 32 and may provide data to 
any GPR 32. 

The PMP 30 decodes the following types of instmctions in one embodiment of the 
invention. An MAC instmction provides control of where the inputs to the multiply 
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accumulate slave processor come from and where the results go to as well as a clear bit. 
REPEATN, JUMPN and RETURN are as described previously. 

An example of the implementation of the above instruction types is set forth in the 
following instruction decode table: 



ITYPE 


note 


MSB to LSB BIT MAP 


000 


MAC 


Spare(I)&CLR(l)&DEST(6)(S:SOURCE B(6)&SOURCE A(6)&ITYPE(3) 


001 


Repeat N 


REPEAT N(5)&rrYPE(3) 


010 


jumpN 


JUMP CNrr(6)&INST ADDR(7)&ITYPE(3) 


Oil 


return 


ITYPEO) 


100-111 


reserved 





In the above table, CLR is a bit that when set to zero forces a feedback loop of the 
32+N bit accumulator to zero. This bit is usually asserted at the beginning of a set of 
accumulation calculations to initialize the accumulator. If the bit is set equal to one for N 
multi-cycles, the MAC operates as a multiplier. DEST is the destination address of the 
accumulation as defined by the source and destination memory map. SOURCE_A is the 
address of the A input to the multiplier and SOlJRCE_B is the address of the B input to the 
multiplier. 

ITYPE, REPEAT^N, JUMP_„CNT, INST_ADDR, and RETURN axe as described 
previously. 

The PMP 30 allows multiply or multiply and accumulate operations to incur 
independently of other processors. The processor is fully modular allowing scalability of the 
digital signal processor 10. 

Data Path Interface 

. Referring to Fig. 10, a data path interface method used in the digital signal processor 
10, allows self-timed single cycle or multi-cycle math processors to be interchanged without 
affecting the instruction decode. This enables greater portability of the processor 10 to 
different process technologies. The interface also allows the implementation of self-timed 
instructions that are dependent only on math processor timing delays. 

All data inputs to the arithmetic elements 94 (such as a PMP 28 or 30) are registered 
outputs and stable until other data values are presented. A valid input signal (IN^VALID) is 
provided when new data is suppUed. New data is only supplied if a busy signal firom an 
arithmetic element 94 is not asserted. 

In a pipeline element 96, shown in Figure 1 1, the busy signal may always be low 
because data can be continuously fed to the arithmetic element 96. A delay element 100 may 
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be used to delay the operation of the arithirietic element by as much as one clock cycle. For 
example, the math processor may be divided into units 102 and 104 with a delay element 100 
in between. ' Similarly, the input valid signal (IN_VALID) may be delayed by one cycle 
delaying the output valid (OUT_VALID) by one cycle. Likewise, the data direction signal 
5 (IN_DEST_ADDR), which tells where the data may go, may be delayed. 

In a multi-cycle arithmetic element 98, shown in Figure 12, the busy signal can be 
used to hold off new data from being sourced to the arithmetic element 98. Destination 
addresses for the result of the mathematical operation and mode change signals may be 
supplied to the arithmetic element 98 to help stabilize it until new data is present. 
10 The arithmetic element 98 provides internal delays to match the latency of the 

arithmetic such that multi-cycle operation occurs. The operation may be spread over two or 
more clock cycles. The input data valid (IN_VALID) and input destination 
(IN_DEST_ADDR) signals may also be delayed the needed number of cycles (N). Error flag 
signals are provided and registered by the arithmetic element. The input data valid signal to 
15 qualify the input data, and the mode or control signals to the arithmetic element 98 are 
asserted for new data sent to the arithmetic element. 

These interface methods allow a digital signal processor to transcend different process 
technologies. In some cases the sole redesign needed for a new function may be to redesign 
the math modules. Instructions set in instruction decode logic need not be changed to 
20 accommodate different arithmetic element timing changes. This allows a more portable 
design amenable to different process technology changes. 

Through the use of pipeUned or multi-cycled processes, different mathematical 
processors may be added to the overall processor regardless of whether they require more or 
less time than the processor which they replace. Thus, in cases where a slower math 
25 processor is replacing a faster math processor, a pipelined or a multi-cycled architecture may 
be utilized to compensate for the additional delay time. Conversely, if the new math 
processor is faster than the one which it replaces, the faster math processor may be used 
without change except as described hereinafter. 

In each case, the MFC 18 is recompiled to adjust to the slower or faster timing of a 
30 new math processor. Regardless of whether the new timing is longer or shorter, all that is 
needed is to recompile the MFC 18. The MFC 1 8 then operates with the new timing. Thus, 
the system may be easily and quickly adapted to new processors which are made on different 
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process technologies and which may be faster or slower than the processor for which the 
system was originally designed. 

While the present invention has been described with respect to a limited number of 
embodiments, those skilled in the art will appreciate numerous modifications and variations 
therefrom. It is intended that the appended claims cover all such modifications and variations 
as fall within the true spirit and scope of this present invention. 
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1 LA digital signal processor comprising: 

2 a mathematical processor; 

3 an input processor that processes input signals to the digital signal processor; 

4 an output processor that processes output signals from the digital signal 

5 processor; 

6 a master processor that controls said rnathematical processor, said input 

7 processor and said output processor; and 

8 a storage selectively accessible by each of said processors. 

1 2. . The digital signal processor of claim 1 further including a random access 

2 memory processor that stores intermediate calculation results. 

1 3. The digital signal processor of claim 2 including a bus coupling each of said 

2 processors to said storage. 

1 4. The digital signal processor of claim 1 wherein said input and output 

2 processors also implement mathematical operations. 

1 5. The digital signal processor of claim 1 wherein each of said processors have 

2 their own instructions sets. 

1 6. The digital signal processor of claim 1 wherein said processors communicate 

2 with one another through said storage. 

1 7. The digital signal processor of claim 1 wherein each of said processors use 

2 very long instruction words. 

1 8. The digital signal processor of claim 1 wherein said master processor provides . 

2 the timing for the other processors. 
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9. The digital signal processor of claim 1 wherein said master processor waits for 
the input processor to complete a given operation. 

10. The digital signal processor of claim 1 wherein each of said processors 
includes its own random access memory. 

1 1 . The digital signal processor of claim 1 wherein said storage includes a 
plurality of registers, said registers automatically transfer existing data from a first register to 
a second register when new data is being written into said first register. 

12. The digital signal processor of claim 1 1 wherein said input processor causes 
the automatic transfer of data. 

13. The digital signal processor of claim 1 1 wherein said mathematical processor 
causes said data to be transferred from one register to another. 

14. The digital signal processor of claim 1 including a mathematical processor 
which is pipelined. 

15. The digital signal process of claim 1 wherein said mathematical processor is a 
multi-cycled mathematical processor. 

16. A method of digital signal processing comprising: 

using a first processor to process input signals to said digital signal processor; 
using a second processor to process output signals from said signal digital 
signal processor; 

using a third processor for mathematical operations; 

controlling said first, second and third processors using a fourth processor; and 
enabling each of said processors to selectively access a storage. 

17. The method of claim 16 including providing the timing from said fourth 
processor for each of the other processors. 
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1 18. The method of claim 16 including automatically transferring data from a first 

2 register in said storage to a second register in said storage when new data is being written into 

3 said first register. 

1 19. The method of claim 18 including automatically transferring said data in 

2 response to action by said first processor. 

1 20. The method of claim 1 8 including automatically transferring said data in 

2 response to action by said third processor. 
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